21天可以養成一個好習慣, 但3天就可以放棄, 但我今天還是準時坐在這裡, 所以今天你的選擇是什麼呢?
我們今天要來討論的是,選擇
阿不是啦是select()
其實好像也差不多,總而言之,這是一個可以用來選取你需要的column的function
Let's Start!
select()
相信有寫過SQL的大家都對這個不是很陌生,對的,就是一樣的概念,pyspark 的select()
, 跟SQL SELECT其實是一樣的,如果你是個非常熟悉資料庫的人,可以果斷地跳過直接前往下一篇XD
上一篇中有簡單介紹了一下, show()
, 相信大家都對他的參數非常熟悉了!
但當我們今天想要調整的不是看到的筆數,而是column欄位的數量時,這時候就可以使用select()
情境說明:
可以用來選取需要的column,不論是在處理資料或是確認data確認細節很有幫助!
rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.select(df.Name,df.Age).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
| Thing|Hour| Name|age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
| movie| 2| Carmen| 23|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
| play| 2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(df.Name,df.Age).show()
+---------+---+
| Name|Age|
+---------+---+
| Carmen| 23|
| Carmen| 23|
| Don José| 25|
|Escamillo| 30|
|Escamillo| 30|
+---------+---+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
select('*')
- 多來點選擇吧?這個也非常簡單易懂啦XD
來個簡單的小說明相信大家就懂了
pyspark : df.show()
= df.select('*').show()
SQL : select * from table
情境說明:
當你需要所有欄位時,不可能一個一個打吧
所以小秘招又出現了 - 也就是select('*')
rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.select('*').show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
| Thing|Hour| Name|age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
| movie| 2| Carmen| 23|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
| play| 2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select('*').show()
+-------+----+---------+---+
| Thing|Hour| Name|Age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
| movie| 2| Carmen| 23|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
| play| 2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
select()
- 選擇+計算?情境說明:
當你需要計算一下的時候XD
rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.select(df.Name, (df.Age + 10).alias('age')).show()
df.select(mean('Age'),avg('Age'),sum('Age'),max('Age'),min('Age')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
| Thing|Hour| Name|age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
| movie| 2| Carmen| 23|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
| play| 2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT[1]+---------+---+------------+
df.select(df.Name, (df.Age + 10).alias('age')).show()
+---------+---+
| Name|age|
+---------+---+
| Carmen| 33|
| Carmen| 33|
| Don José| 35|
|Escamillo| 40|
|Escamillo| 40|
+---------+---+
+---------+---+------------+OUTPUT[1]+---------+---+------------+
+---------+---+------------+OUTPUT[2]+---------+---+------------+
df.select(mean('Age'),avg('Age'),sum('Age'),max('Age'),min('Age')).show()
+--------+--------+--------+--------+--------+
|avg(Age)|avg(Age)|sum(Age)|max(Age)|min(Age)|
+--------+--------+--------+--------+--------+
| 26.2| 26.2| 131| 30| 23|
+--------+--------+--------+--------+--------+
+---------+---+------------+OUTPUT[2]+---------+---+------------+
'''
總而言之,這個大概是基本的選取資料的方式, 後面的時間還會使用到很多關於select()
的combo技!
selectExpr()
我個人認為這個其實與select()
是一樣的,只是看個人使用習慣決定!
當然也可以使出各種combo技!
情境說明:
SQL語法熟悉愛好者,會非常容易上手!
rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.selectExpr("sum(case when Age >= 25 then 1 else 0 end) as ageOver25").show()
df.selectExpr('mean(Age)','avg(Age)','sum(Age)','max(Age)','min(Age)').show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
| Thing|Hour| Name|age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
| movie| 2| Carmen| 23|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
| play| 2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT[1]+---------+---+------------+
df.selectExpr("sum(case when Age >= 25 then 1 else 0 end) as ageOver25").show()
+---------+
|ageOver25|
+---------+
| 3|
+---------+
+---------+---+------------+OUTPUT[1]+---------+---+------------+
+---------+---+------------+OUTPUT[2]+---------+---+------------+
df.selectExpr('mean(Age)','avg(Age)','sum(Age)','max(Age)','min(Age)').show()
+---------+--------+--------+--------+--------+
|mean(Age)|avg(Age)|sum(Age)|max(Age)|min(Age)|
+---------+--------+--------+--------+--------+
| 26.2| 26.2| 131| 30| 23|
+---------+--------+--------+--------+--------+
+---------+---+------------+OUTPUT[2]+---------+---+------------+
'''
其實兩種方法都可以達到你想要的效果,會使用哪種主要就是看個人習慣,這個部分官方文件也能夠找到蛛絲馬跡!
如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】